Finding Similar Files in a Large File System

نویسنده

Udi Manber

چکیده

We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The running time for finding all groups of similar files, even for as little as 25% similarity, is on the order of 500MB to 1GB an hour. The amount of similarity and several other customized parameters can be determined by the user at a post-processing stage, which is very fast. Sif can also be used to very quickly identify all similar files to a query file using a preprocessed index. Application of sif can be found in file management, information collecting (to remove duplicates), program reuse, file synchronization, data compression, and maybe even plagiarism detection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Zebra Striped

Zebra is a network file system that increases throughput by striping the file data across multiple servers. Rather than striping each file separately, Zebra forms all the new data from each client into a single stream, which it then stripes using an approach similar to a log-structured file system. Thm provides high performance for writes of small files as well as for reads and writes of large ...

متن کامل

File Annotation and Sharing on Low end Devices in PAN

Fast development in low end devices permits so many extra aspects including huge storage capacity, sharing of data, formation of network, etc. User can use these aspects very efficiently. When data size is limited searching a particular file can done easily but difficulty comes when data size increases and it involves large collection of mobile nodes. In that case arrangement of data i.e. data ...

متن کامل

Scale and Concurrency of Massive File System Directories

File systems store data in files and organize these files in directories. Over decades, file systems have evolved to handle increasingly large files: they distribute files across a cluster of machines, they parallelize access to these files, they decouple data access from metadata access, and hence they provide scalable file access for high-performance applications. Sadly, most cluster-wide fil...

متن کامل

Exploring the Use of BitTorrent as the Basis for a Large Trace Repository

Motivated by the need to deploy a public repository of multi-gigabyte trace files, we studied the BitTorrent protocol’s ability to disseminate very large files among peers. BitTorrent is a popular peer-to-peer protocol that allows parallel downloads of large files. In this paper, we analyzed user activity on BitTorrent over a four-month period with respect to supportable file sizes, file popula...

متن کامل